Day 2

Uroš Godnov

String manipulation

Putting strings together with stringr

  • str_c()
  • the c is short for concatentate, a function that works like paste()
[1] "Beautiful day"
[1] NA
[1] "Beautiful NA"

Length

[1] 5 5
[1] 5 5
Error in nchar(f): 'nchar()' requires a character vector
[1] 4 4 8 3

Extracting substrings

  • str_sub()
  • extracts parts of strings based on their location
  • first argument, string, is a vector of strings
  • the arguments start and end specify the boundaries of the piece to extract in characters
  • both start and end can be negative integers, in which case, they count from the end of the string
[1] "Bruc" "Wayn"
[1] "uce" "yne"

Matches

  • str_detect(): answers the question: Does the string contain the pattern?
  • str_subset(): subsetting strings based on match
  • str_count(); counting matches
[1] FALSE  TRUE  TRUE
[1] "pepperoni"                 "sausage and green peppers"
[1] 0 1 1
[1] 3 2 5

Parsing strings into variables

str_split(): pull apart raw string data into more useful variables

[[1]]
[1] "23.01.2017" "29.01.2017"

[[2]]
[1] "30.01.2017" "06.02.2017"

Replacing matches in strings

[1] "192" "118" "001"
[1] "510.555.0123" "541.555.0167"

Lab

  • open names.txt and copy the content
  • you’ll turn a vector of full names, like “Bruce Wayne”, into abbreviated names like “B. Wayne”. This requires combining str_split(), str_sub() and str_c().
  • do task using str_split with simplify=TRUE
  • calculate how many names end with a, h, s and e.

Regular expressions (Advanced)

Why Regular expressions

  • Most commonly when working with strings
  • Extracting something from something
  • Which number is the following string: "10202"?
  • Is there a number in a string: "102a"? What about in this "1O2"?
  • We would like to seperate the following string into 3 columns "2,32.1,0.4"!

Regular expressions

  • syntax to describe patterns
  • functions on patterns

grep

  • grep function from base
  • sub for replacment
  • stringr package
  • grep - global regex print. Is there a patern in a string?
string <- "car"
pattern <- "car"
grep(pattern, string)
[1] 1
string <- c("car", "cars", "in a car", "truck", "car's trunk")
pattern <- "car"
grep(pattern, string)
[1] 1 2 3 5

grepl

  • grepl - returns logical value
string <- c("car", "cars", "in a car", "truck", "car's trunk")
pattern <- "car"
grepl(pattern, string)
[1]  TRUE  TRUE  TRUE FALSE  TRUE

Meta and special characters

  • special characters: . \ | ( ) [ { ^ $ * + ?
  • \- escape character
  • . - any (just one) character
  • ^ - begining of a string
  • $ - end of string
[1] 1 2 5
[1] 1
[1] 2

Alphanumeric character

  • w
[1] 2 3 4
[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE

Non - alphanumeric character

  • W
[1] 1 5 6
[1]  TRUE FALSE FALSE FALSE  TRUE  TRUE

Whitespace

  • s
[1] 1 6

Non - whitespace

  • S
[1] 2 3 4 5

Digit

  • d
[1] 3

Non - digit

  • D
[1] 1 2 4 5 6

Possible values for a character

  • using []
[1] 1 2 4
[1] 1 2

All n-characters smallcaps words

[1] 4 7 8

One or two digits anywhere

  • | - or sign
  • () - group
[1] 1 2 3 5 6

Exactly one or two digits

[1] 1 2 3
[1] 1 2 3 6

Repeating (1)

  • repeating operators refer to last character or group
  • ? - matches at most 1 times
  • * - matches at least 0 times
  • + - matches at least 1 times
  • {m} – matches exactly m times
  • {m, n} – matches between m and n times
  • {m, } – matches at least m times
[1] 2 3 4 5 6
[1] 3 4 5 6

Repeating (2)

[1] "ab"  "acb"
[1] "accb"
[1] "accb"   "acccb"  "accccb"
[1] "accb"  "acccb"

All smallcaps letter words

[1] 1 2

Words with letters and length from 3-5

[1] 1 7

Signed numbers

[1] 2 4

Greedy and Lazy Repetition

  • the repetition operators or quantifiers are greedy
  • how to make them lazy?
  • regexpr
  • regmatches
[1] "<EM>first</EM>"
[1] "<EM>"

gregexpr

  • finds all positions and lengths of matched patterns
[[1]]
[1] 17 46
attr(,"match.length")
[1] 3 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

regmatches

  • extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec
[[1]]
[1] "100" "45" 

Lab

  • Open regular.txt and complete the excercises

Stringr and regular expressions

  • str_subset (grep)
 [1] "bell pepper"       "blood orange"      "canary melon"     
 [4] "chili pepper"      "goji berry"        "kiwi fruit"       
 [7] "purple mangosteen" "rock melon"        "salal berry"      
[10] "star fruit"        "ugli fruit"       
  • str_detect (grepl)
  • str_extract/str_extract_all (gregexpr+regmatches)
[1] "100"
[[1]]
[1] "100" "45" 

Text mining (optional)

Data

Data

Why text mining?

  • 85-90 percent of all corporate data is in some kind of unstructured form (e.g., text )
  • unstructured corporate data is doubling in size every 18 months
  • text Mining Concepts Benefits of text mining are obvious especially in text rich data environments e.g
  • law (court orders)
  • academic research (research articles)
  • finance (quarterly reports)
  • medicine (discharge summaries)
  • marketing (customer comments)
  • electronic communication records (e.g., Email) Spam filtering Email prioritization and categorization Automatic response generation

Key steps

1. Collection of text document

2. Pre – processing of text

3. Text mining techniques

4. Analyze the text

5. Knowledge discovery

1. Collection of text document

  • web scraping
  • scanning and OCR
  • internal documents

2. Pre – processing of text

  • tokenization
  • removal of stop words
  • stemming

Preprocessing: tokenize and N-grams

  • N-grams are combination of letters of lenght n in the source text

Tokenize - example

text <- c("Great white shark just ate my leg.","Not a wonderful day and days!")
text_df <- tibble(id = 1:2, text = text)

text_df %>%
  unnest_tokens(word, text)
# A tibble: 13 × 2
      id word     
   <int> <chr>    
 1     1 great    
 2     1 white    
 3     1 shark    
 4     1 just     
 5     1 ate      
 6     1 my       
 7     1 leg      
 8     2 not      
 9     2 a        
10     2 wonderful
11     2 day      
12     2 and      
13     2 days     

Removal of stop words

  • the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc)

  • not adding much information to the text

  • examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”,…

  • why removing stop words; removing the low-level information from our text in order to give more focus to the important information

  • Do we always remove stop words? NO!

  • Before removing stop words, research a bit about your task and the problem you are trying to solve, and then make your decision!

Removal of stop words

data(stop_words)

text_df %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words)
# A tibble: 7 × 2
     id word     
  <int> <chr>    
1     1 white    
2     1 shark    
3     1 ate      
4     1 leg      
5     2 wonderful
6     2 day      
7     2 days     

Stemming and lemmatization

  • Stemming: the process of reducing a word to its stem or root format
  • Lemmatization: the transformation that uses a dictionary to map a word’s variant back to its root format

Stemming and lemmatization

Stemming and lemmatization - stemming

text_df %>%
  unnest_tokens(word, text) %>% 
  mutate(word=wordStem(word))
# A tibble: 13 × 2
      id word  
   <int> <chr> 
 1     1 great 
 2     1 white 
 3     1 shark 
 4     1 just  
 5     1 at    
 6     1 my    
 7     1 leg   
 8     2 not   
 9     2 a     
10     2 wonder
11     2 dai   
12     2 and   
13     2 dai   

Stemming and lemmatization - lemmatization

text_df %>%
  unnest_tokens(word, text) %>% 
  mutate(word=lemmatize_words(word))
# A tibble: 13 × 2
      id word     
   <int> <chr>    
 1     1 great    
 2     1 white    
 3     1 shark    
 4     1 just     
 5     1 eat      
 6     1 my       
 7     1 leg      
 8     2 not      
 9     2 a        
10     2 wonderful
11     2 day      
12     2 and      
13     2 day      

3. Text mining techniques

Concepts

  • bag of words
  • NLP

Bag of words

  • mostly used technique
  • every word is independent (mostly)
  • stemming/lemmatization (tourist, tourists and tourism may be the same word)
  • word frequency
  • POS
  • sentiment analysis (different lexicons)
  • entities extraction
  • topics identification (e.g. LDA algorithm)

NLP

  • uses dictionaries to learn (e.g. Stanford NLP)
  • a subfield of artificial intelligence and computational linguistics. the study of “understanding” the natural human language

Demo

Stanford NLP

Sentiment analysis

  • lexicons
  • simple (e.g. Liu&Hu): 1 negative, 0 neutral, +1 positive
  • advanced (e.g. AFINN): -5<–>+5

How to do it in R - 1

  • many packages:

    • tm, LDA, textmineR, tidyr, sentimentr,…
data<-read_xlsx("TA_reviews.xlsx")
sentences<- get_sentences(data$fullrev[1:2])
sentences
[[1]]
[1] "The hotel is ideally located and is in a beautiful building."                                                                        
[2] "Most of the staff are very polite and helpful."                                                                                      
[3] "Rooms are comfortable and it has a serviceable gym."                                                                                 
[4] "Avoid going to breakfast before 0700 or wearing flip flops or slippers, you will be admonished and sent back to your room to change."

[[2]]
[1] "The hotel is a short walk to the pedestrian mall, restaurants and cafes."    
[2] "The hotel is an old historical landmark."                                    
[3] "I loved the tall ceilings, lobby and restaurant."                            
[4] "The bathroom has been updated and is very nice."                             
[5] "The breakfast buffet is very good with many options and you can eat outside."
[6] "We enjoyed our stay here."                                                   

attr(,"class")
[1] "get_sentences"           "get_sentences_character"
[3] "list"                   

How to do it in R - 2

dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR
   element_id word_count        sd ave_sentiment
1:          1         52 0.3861717     0.2993681
2:          2         56 0.2501910     0.2914671
dfSE
   element_id word_count        sd ave_sentiment
1:          1         52 0.1763900    0.14263246
2:          2         56 0.1420433    0.02832483

How to do it in R - 3

sentences<-"The great white shark just ate my leg!"

dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR
   element_id word_count sd ave_sentiment
1:          1          8 NA   -0.03535534
dfSE
   element_id word_count sd ave_sentiment
1:          1          8 NA   -0.04787702

How to deal with datetime

Creating date/times - 1

  • the default format for date is yyyy-mm-dd
  • the default format for time is hh:mm:dd
  • a date-time is a date plus a time
  • native class for storing time: hms package
[1] "2019-05-05 18:51:32 CEST"
[1] "2019-05-05 18:51:32 CEST"
[1] "2019-05-05 18:51:32 UTC"

Creating date/times - 2

  • to get the current date or date-time you can use today() or now() - same as in Excel
  • from a string, from individual date-time components, from an existing date/time object
[1] "2024-01-22"
[1] "2024-01-22"
[1] "2024-01-22 14:02:52 CET"
[1] "2024-01-22 14:02:52 CET"

From strings - 1

  • using lubridate
[1] "2017-01-31"
[1] NA
[1] "2017-01-31"
[1] "2018-10-17"

From strings - 2

[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 08:01:00 UTC"

From strings - 3

Individual components

  • make_date()
  • make_datetime()
[1] "2007-11-05"
[1] "2007-11-05 15:07:00 UTC"

From other types

  • as_datetime()
  • as_date()
[1] "2024-01-22 UTC"
[1] "2024-01-22"

Lab

  • Use the appropriate lubridate function to parse each of the following dates:

Date-time components - 1

[1] 2019
[1] 5
[1] 5
[1] 125

Date-time components - 2

[1] 1
[1] 19
[1] 23
[1] 13

Date-time components - 3

  • for month() and wday() you can set label = TRUE
  • abbr = FALSE to return the full name
[1] May
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
[1] Sunday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Lab

  • On which day of the week were you born?
  • On which day of the week will you celebrate 40th birthday?

Durations - 1

  • when you subtract two dates, you get a difftime object
  • a difftime class object records a time span of seconds, minutes, hours, days, or weeks
  • lubridate provides an alternative which always uses seconds: the duration
Time difference of 17665 days
[1] "1526256000s (~48.36 years)"

Durations - 2

[1] "15s"
[1] "600s (~10 minutes)"
[1] "43200s (~12 hours)" "86400s (~1 days)"  

Durations - 3

[1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
[4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
[1] "1814400s (~3 weeks)"
[1] "31557600s (~1 years)"

Periods - 1

  • periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months
[1] "15S"
[1] "10M 0S"
[1] "12H 0M 0S" "24H 0M 0S"
[1] "7d 0H 0M 0S"

Periods - 2

[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
[5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
[1] "21d 0H 0M 0S"
[1] "1y 0m 0d 0H 0M 0S"

Periods - 3

  • you can add periods to dates
[1] "2019-10-19 06:00:00 UTC"

Periods - 4

  • contrive a period to/from a given number of seconds
[1] "1116d 13H 2M 53.4735701084137S"

Lab

  • import NYCTaxi.xlsx from link
  • convert pickup_datetime, dropoff_datetime to datetime format (e.g. dataNYC$pickup_datetime<-ymd_hms(dataNYC$pickup_datetime))
  • calculate the mean duration drive
  • use seconds_to_period function

As.date - 1

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)

As.date - 2

[1] "2008-09-28"

Locale - 1

  • change locale
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
[1] "01-03-18"
[1] "01-Mrz-2018"
[1] "01-März-18"

Locale - 2

[1] "Mrz 01, 2018"
[1] "März 01, 2018"
[1] Donnerstag
7 Levels: Sonntag < Montag < Dienstag < Mittwoch < Donnerstag < ... < Samstag
[1] ""

Lab

Use sys.getlocale and sys.set.locate to:

  • display today’s month in Checz
  • display today’s day in Swedish

Lab

In this exercise you will work with the date, “1930-08-30”, Warren Buffett’s birth date! Mind the locale language!

  • Use as.Date() and an appropriate format to convert “08,30,1930” to a date (it is in the form of “month,day,year”)
  • Use as.Date() and an appropriate format to convert “Aug 30,1930” to a date
  • Use as.Date() and an appropriate format to convert “30aug1930” to a date
  • also solve previous tasks with lubridate functions

Update object

[1] "2010-01-01"
[1] "2009-02-10 00:10:03 UTC"